Name: Harsh, Ruiqi

1. Import

Pandas Profiling Report:

2. Visualizing the Data

3. Data Processing

3.1 Data Cleaning

Remove the last two rows, which all columns are NaN.

Set the outliers as NaN.

Using interpolate() method to fill missing values in time-series data.

Pandas Profiling Report (after data cleaning):

Convert the y column on a given observation day into a NumPy array.

3.2 Train, Validation, and Test Splits

Train (70%), validation (15%), test split (15%)

3.3 Normalizing the Data

Reshape the data to a 2D array from a scaler array.

3.4 Creating Sequences

4. Regular feedforward neural network

4.1 Define the model

If the number of neurons are too large and too much hidden layers, the model will result overfitting. On the other hand, the number of neurons are too small will result underfitting. We noticed that setting the nunmber of neurons as the look back step could reduce the test set error.

4.2 Fitting the model

4.4 Error Plot

From the error plot, we need to stop the iteration at 50 to prevent overfitting.

4.5 Performance on Test Set

4.6 Predicted values of y for March 1st and March 2nd

5. RNN

5.1 Preparing the data

5.2 Define the GRU model

If the number of neurons are too large and too much hidden layers, the model will result overfitting. On the other hand, the number of neurons are too small will result underfitting. We noticed that setting the nunmber of neurons as the look back step * number of foresight could reduce the test set error.

5.3 Fitting the GRU

5.4 Error Plot

From the error plot, we need to stop the iteration at 7 to prevent overfitting.

5.5 Performance on Test Set

5.6 Predicted values of y for March 1st and March 2nd

6. 1d Convolutional Neural Network

6.1 Preparing the data

6.2 Normalizing the Data

Reshape the data to a 2D array from a scaler array.

6.3 Creating Sequences

6.4 Define the model

If the number of neurons are too large and too much hidden layers, the model will result overfitting. On the other hand, the number of neurons are too small will result underfitting. We noticed that setting the nunmber of neurons as 64 in the fist layer, and the look back step as the second layer could reduce the test set error.

6.5 Fit the model

6.5 Error Plot for Conv1D

From the error plot, we need to stop the iteration at 58 to prevent overfitting.

6.6 Performance on Test Set

6.7 Predicted values of y for March 1st and March 2nd

7. VAR

7.1 Preparing the Data

7.2 Imports

7.3 Define a VAR model

7.4 Determine optimum lag order (p) by performing model fits at different lag orders to find lowest AIC

The optimum lag order p = 1

7.5 Model fit at optimum lag order and get results

The model is stationary since the absolute of all roots are greater than 1

7.6 Investigate Granger causality between series combinations (if any)

7.7 Predicted values of y for March 1st and March 2nd

8. Conclusion

With the assessment of differnt models for predicting y from the "data.csv" file, we decided to train the model using 1d convolutional neural network. From the error plots, the model using 1d convolutional neural network has the lowest test error, which is around 5.7. In addition, we used this model to predict y = 74 on March 1st, and y = 73 on March 2nd.